30990081 WO 




652176653109/ 9g J 

3C02 Rec'd PCVPTO 1 4 FEB 2001 



COMPUTER ARCHITECTURE CONTAINING PROCESSOR AND 



COPROCESSOR 



FIELD OF INVENTION 

5 

The invention relates to computer architectures involving a main processor and a 
coprocessor. 

DESCRIPTION OF PRIOR ART 



Microprocessor-based computer systems are typically based around a general purpose 
microprocessor as CPU. Such microprocessors are well adapted to handle a wide 
range of computational tasks, but they are inevitably not optimised for all tasks. 
Where tasks are computationally intense (such as media processing) then the CPU 
15 will frequently not be able to perform acceptably. 

One of the possible approaches to this problem is to use coprocessors specifically 
adapted to handle individual computationally difficult tasks. Such coprocessors are 
termed ASICs (Application Specific Integrated Circuits). These are built for specific 

20 computational tasks, and can thus be optimised for such tasks. They are however 
inflexible both in use and in programming (as they are designed for a specific task 
alone) and are typically slow to produce. Improved solutions can be found by 
construction of flexible hardware which can be programmed with a configuration 
particularly suited to a given computational task, such as FPGAs (Field 

25 Programmable Gate Arrays). Further flexibility is achieved if such structures are not 
only configurable, but reconfigurable. An example of such a reconfigurable structure 
is the CHESS array, discussed in International Patent Application No. GB98/00262, 
International Patent Application No. GB98/00248, US Patent Application No. 
09/209,542, filed on 1 1 December 1998, and its European equivalent European Patent 

30 Application No. 98309600.9. 

Although use of such coprocessors can considerably improve the efficiency of such 
computation, the limitations of the microprocessor acting as CPU can still have a very 
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significant effect on overall system performance where such computations are 
required. It would be desirable to improve a processor-coprocessor system still 
further such that the limitations of the processor have a lesser effect on overall 
performance. 

5 

SUMMARY OF INVENTION 

Accordingly, there is provided computer system, comprising: a first processor; a 
second processor for use as a coprocessor to the first processor; a memory; and a 

10 decoupling element; wherein instructions are passed to the second processor from the 
first processor through the decoupling element, such that the second processor 
consumes instructions derived from the first processor through the decoupling 
element, and wherein the second processor receives data from and writes data to the 
memory, whereby the processing of instructions by the second processor is decoupled 

1 5 from the operation of the first processor. 

This arrangement can produce considerable improvements in performance, as the first 
processor, typically a general purpose microprocessor, can switch tasks while 
execution of the instructions is carried out on the second processor, typically a 
20 processor specially adapted to carry out the computation or type of computation 
delegated to it. This is very important when the first processor is the central 
processing unit of a computer device, and thus may be required for a number of other 
tasks. It is a particularly effective arrangement when the second processor is 
configurable or reconfigurable. 

25 

The only task relating to the computation that may be left to the first processor is 
servicing of the decoupling element (so that it can provide instructions effectively). 
Advantageously, the decoupling element may be set up so that it will require no such 
servicing during performance of the delegated task. 

30 

One possible choice of decoupling element is a coprocessor instruction queue, 
wherein instructions are added to the coprocessor instruction queue by the first 
processor and consumed from the coprocessor instruction queue by the coprocessor. 
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An alternative choice is a state machine, wherein information to provide instructions 
is provided to the state machine by the first processor, and instructions are provided in 
an ordered sequence to the second processor by the state machine. A further 
alternative choice is a third processor, wherein information to provide instructions to 
5 the second processor is provided to the third processor by the first processor, and 
instructions are provided in an ordered sequence to the second processor by the third 
processor. 

An effective arrangement is for the system to include a coprocessor controller for 
10 controlling the activity of the second processor and for synchronising the execution of 
the coprocessor with loads from memory. 

The system is particularly effective if it also includes a buffer memory from which the 
second processor loads data and to which the second processor stores data, wherein 
15 the buffer memory is adapted to load data from the memory and store data to the 
memory. This has significant performance benefits for media algorithms in particular 
if the memory is dynamic random access memory, and the buffer memory is adapted 
to load data from, or store data to, the buffer memory in bursts. 
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Decoupling of the first processor from the buffer memory can be achieved by use of a 
second decoupling element, wherein memory instructions relating to movement of 
data between the buffer memory and the memory are passed to the buffer memory 
from the first processor through this second decoupling element, such that the buffer 
memory consumes instructions derived from the first processor through the second 
decoupling element. The processing of memory instructions by the buffer memory is 
thus decoupled from the operation of the first processor. 

Where such a buffer memory is used, and as the first processor is decoupled from the 
other system elements, it is desirable for there to be a synchronisation mechanism to 
synchronise transfer of data between the buffer memory and the memory with 
execution of instructions by the second processor. Preferably, this is adapted to block 
execution of instructions by the second processor on data which has not yet been 
loaded to the buffer memory from the memory, and is adapted to block execution 
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memory instructions for storage of data from the buffer memory to the memory where 
relevant instructions have not yet been executed by the second processor. Greatest 
efficiency is achieved when if execution of instructions or memory instructions is 
blocked by the synchronisation mechanism, other instructions or memory instructions 
5 which are not blocked by the synchronisation mechanism may still be carried out. 

In a further aspect, the invention provides a method of operating a computer system, 
comprising: providing code for execution by a first processor; extraction from the 
code of a task to be carried out by a second processor acting as coprocessor to the first 
10 processor; passing information defining the task from the first processor to a 
decoupling element; passing instructions derived from said information from the 
decoupling element to the second processor and executing said instructions on the 
second processor, wherein the processing of said instructions by the second processor 
is decoupled from the operation of the first processor. 

15 

BRIEF DESCRIPTION OF FIGURES 

Specific embodiments of the invention will be described further below, by way of 
20 example, with reference to the accompanying drawings, in which: 

FigureJ_shows the basic elements of a system in accordance with a first embodiment 
of the invention; 

25 Figure 2 shows the architecture of a burst buffers structure used in the system of 
Figure 1 ; 

Figur e 3 sh ows further features of the burst buffers structure of Figure 2; 

30 Figure4^hows the structure of a coprocessor controller used in the system of Figure 1 
and its relationship to other system components; 
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Figure 5_shows an example to illustrate a computational model usable on the system 
of Figure 1; 

Figure_6shows a timeline for computation and I/O operations for the example of 
5 Figure 5; 

Figure 7 shows an annotated graph provided as output from the frontend of a 
toolchalnuseful to provide code for the system of Figure 1; 

10 Figure_8shows a coprocessor internal configuration derived from the specifications in 
Figure 7; 

Figure 9 shows the performance of alternative architectures for a 5x5 image 
convolution using 32 bit pixels; 

15 

Figure 10_ shows the performance of the alternative architectures used to produce 
Figure 9 for a 5x5 image convolution using 8 bit pixels; 

Figures 1 1 AandllB_show alternative pipeline architectures employing further 
20 embodiments of the present invention; 

Figure 12_shows two auxiliary processors usable as an alternative to the coprocessor 
instruction queue and the burst instruction queue in the architecture of Figure 1 ; and 

25 Figure J3 shows implementation of a state machine as an alternative to the 
coprocessor instruction queue in the architecture of Figure 1. 
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DESCRIPTION OF SPECTFTC EMBODIMENTS 

Figure 1 shows the basic elements of a system in accordance with a first embodiment 
of the invention. Essentially, the system comprises a processor 1 and a coprocessor 2, 
established so that a calculation can be partitioned between the processor 1 and the 
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coprocessor 2 for greatest computational efficiency. The processor 1 may be 
essentially any general purpose processor (for example, an i960) and the coprocessor 
2 essentially any coprocessor capable of handling with significantly greater 
effectiveness a part of the calculation. In the specific system described here, 
essentially the whole computation is to be handled by the coprocessor 2, rather than 
by the processor 1 - however, the invention is not limited to this specific arrangement. 

In the system specifically described, coprocessor 2 is a form of reconfigurable FPGA, 
as will be discussed further below - however, other forms of coprocessor 2, such as, 
for example, ASICS and DSPs, could be employed instead (with corresponding 
modifications to the computational model required). Both the processor 1 and 
coprocessor 2 have access to a DRAM main memory 3, though the processor 1 also 
has access to a cache of faster access memory 4, typically SRAM. Efficient access to 
the DRAM 3 is provided by "burst buffer" memory 5 adapted to communicate with 
DRAM for the efficient loading and storing of "bursts" of information - burst buffers 
will be described further below. Instructions to the burst buffers 5 are provided 
through a burst instruction queue 6, and the burst buffers 5 operate under the control 
of a burst buffer controller 7. The architecture of the burst buffers is mirrored, for 
reasons discussed below, in the architecture associated with the coprocessor 2. 
Instructions to the coprocessor 2 are provided in a coprocessor instruction queue 8, 
and the coprocessor operates under the control of a coprocessor controller 9. 
Synchronisation of the operation of the burst buffers and the coprocessor and their 
associated instruction queues is achieved by a specific mechanism, rather than in a 
general manner by processor 1 itself. In this embodiment, the mechanism comprises 
the load/execute semaphore 10 and the execute/store semaphore 11, operating in a 
manner which will be described below (other such synchronisation mechanisms are 
possible, as will also be discussed). 

Description of Elements in System Architecture 

The individual elements of the system will now be discussed in more detail. The 
processor 1 generally controls the computation, but in such a manner that some (or, in 
the embodiment described, all) of the steps in the computation itself are carried out in 
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the coprocessor 2. The processor 1 provides, through the burst instruction queue 6, 
instructions for particular tasks: configuration of the burst buffer controller 7; and 
transfer of data between the burst buffer memory 5 and the main memory 3. 
Furthermore, through the coprocessor instruction queue 8, the processor 1 also 
provides instructions for further tasks: configuration of the coprocessor controller 9; 
and initiation of a computation on coprocessor 2. This computation run on 
coprocessor 2 accesses data through the burst buffer memory 5. 

The use of the coprocessor instruction queue 8 effectively decouples the processor 1 
from the operation of coprocessor 2, and the use of the burst instruction queue 6 
effectively decouples the processor 1 from the burst buffers 5. The specific detail of 
this arrangement is discussed in greater detail below. This decoupling will be 
discussed further below in the context of the computational model for this 
embodiment of the invention. 



The coprocessor 2 performs some or all of the actual computation. A particularly 
suitable coprocessor is the CHESS FPGA structure, described in International Patent 
Application No. GB98/00262, International Patent Application No. GB98/00248, US 
Patent Application No. 09/209,542, filed on 11 December 1998, and its European 
equivalent European Patent Application No. 98309600.9, the contents of which 
applications are incorporated by reference herein to the extent permissible by law. 
This coprocessor is reconfigurable, and comprises a checkerboard array of 4-bit ALUs 
and switching structures, whereby the coprocessor is configurable that an output from 
one 4-bit ALU can be used to instruct another ALU. The CHESS architecture is 
25 particularly effective for pipelined calculations, and is effectively adapted here to 
interact with input and output data streams. The coprocessor controller 9 (whose 
operation will be discussed further below) receives high-level control instructions 
(instructions for overall control of the coprocessor 2, rather than instructions relating 
to detail of the calculation - e.g. "run for n cycles") from the coprocessor instruction 
queue 8. The CHESS coprocessor 2 runs under the control of the coprocessor 
controller 9 and receives and stores data through interaction with the burst buffers 5. 
The CHESS coprocessor 2 thus acts on input streams to produce an output stream. 
This can be an efficient process because the operation of the CHESS coprocessor is 
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highly predictable. The detailed operation of computation according to this model is 
discussed at a later point. 

The processor 1 has access to a fast access memory cache 4 in SRAM in a 
5 conventional manner, but the main memory is provided as DRAM 3. Effective access 
to the DRAM is provided by burst buffers 5. Burst buffers have been described in 
European Patent Application No. 97309514.4 and corresponding US Patent 
Application Serial No. 09/3,526, filed on 6 January 1998, which applications are 
incorporated by reference herein to the extent permissible by law. The burst buffer 
10 architecture is described briefly herein, but for full details of this architecture the 
reader is referred to these earlier applications. 

The burst buffer architecture is useful, but not fundamental, to the operation of the 
present invention as described in these embodiments. In the context of the present 
15 invention, the most significant aspect of the burst buffers architecture is that the burst 
buffers 5 operate according to instructions from the processor 1, and that these 
instructions are provided by means of a queue (or alternative, as discussed below). 
This mechanism allows for the possibility of decoupling of the processor 1 from 
operation of the burst buffers 5 in an appropriate architecture. 
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The elements of the version of the burst buffers architecture (variants are available, as 
is discussed in the aforementioned application) used in this embodiment are shown in 
Figures 2 and 3. A connection 12 for allowing the burst buffers components to 
communicate with the processor 1 is provided. Memory bus 16 provides a connection 
to the main memory 3 (not shown in Figure 2). This memory bus may be shared with 
cache 4, in which case memory datapath arbiter 58 is adapted to allow communication 
to and from cache 4 also. 



The overall role of burst buffers in this arrangement is to allow computations to be 
30 performed on coprocessor 2 involving transfer of data between this coprocessor 2 and 
main memory 3 in a way that both maximises the efficiency of each system 
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component and at the same time maximises the overall system efficiency. This is 
achieved by a combination of several techniques: 

burst accesses to DRAM, using the burst buffers 5 as described below; 

simultaneous execution of computation on coprocessor 2 and data transfers 
between main memory 3 and burst buffer memory 5, using a technique called 
"double buffering"; and 

decoupling the execution of processor 1 from the execution of coprocessor 2 
and burst buffer memory 5 through use of the instruction queues. 



10 "Double buffering" is a technique known in, for example, computer graphics. In the 
form used here it involves consuming - reading - data from one part of the burst 
buffer memory 5, while producing - writing - other data into a different region of the 
same memory, with a switching mechanism to allow a region earlier written to now to 
be read from, and vice-versa. 
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A particular benefit of burst buffers is effective utilisation of a feature of conventional 
DRAM construction. A DRAM comprises an array of memory locations in a square 
matrix. To access an element in the array, a row must first be selected (or 'opened'), 
followed by selection of the appropriate column. However, once a row has been 
selected, successive accesses to columns in that row may be performed by just 
providing the column address. The concept of opening a row and performing a 
sequence of accesses local to that row is called a "burst". When data is arranged in a 
regular way, such as in media-intensive computations (typically involving an 
algorithm employing a regular program loop which accesses long arrays without any 
25 data dependent addressing), then effective use of bursts can dramatically increase 
computational speed. Burst buffers are new memory structures adapted to access data 
from DRAM through efficient use of bursts. 
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A system may contain several burst buffers. Typically, each burst buffer is allocated 
to a respective data stream. Since algorithms have a varying number of data streams, a 
fixed amount of SRAM 26 is available to the burst buffers as a burst buffer memory 
area and this amount is divided up according to the number of buffers required. For 
example, if the amount of fixed SRAM is 2 Kbytes, and if an algorithm has four data 
streams, the memory region might be partitioned into four 512 Byte burst buffers. 
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In architectures of this type, a burst comprises the set of addresses defined by: 
burst= {B + Sxi | B,S,i e A^A0<i<L} 



where B is the base address of the transfer, S is the stride between elements, L is the 
length and TV is the set of natural numbers. Although not explicitly defined in tins 
equation, the burst order is defined by i incrementing from 0 to LA. Thus, aburst may 
1 5 be defined by the 3-tuple of: 
{basejiddress, length, stride) 

In software, a burst may also be defined by the element size. This implies that a burst 
maybe sized in bytes, halfwords or words. The units of stride must take tins mto 
account. A "sized-burst" is defined by a 4-tuple of the form: 
{basejiddress, length, stride, size) 

A "channel-burst" is a sized-burst where the size is the width of the channel to 
memory. The compiler is responsible for the mapping of software sized-bursts mto 
channel-bursts. The channel-burst may be defined by the 4-tuple: 
{basejiddress, length, stride, width) 

If the channel width is 32 bits (or 4 bytes), the channel-burst is always of the form: 
(basejiddress, length, stride, 4) 
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or 



abbreviated to the 3-tuple (basejiddress, length, stride). 
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The conic, of this memory and the allocation (and freeing) of burs, buffers is handled 
at a higher !evel by a software process. In me present embodiment, "double 
buffering" is used, but other strategies are certainly possible - the decision involves a 
trade-off between storage efficiency and simplicity. The burst buffer memory area 26 
5 ,oads data from and stores data to the main memory 3 through memory datapath 
arbiter 58, which operates under control of DMA controller 56, response to 
instructions received through the burs, instruction queue 6. Data is exchanged 
between the burst buffer memory area 26 and the processor 1 or the coprocessor 2 
through the connection means 12. As shown in Figure 3, the control interface for the 
,0 burs, buffers system 5 is based around a pair of tables: a Memory Access Table 
(MAT) 65 describing regions of main memory for bursting <o and from the burst 
buffer memory, and a Buffer Access Table (BAT) 66 describing regions of burst 
buffer memory. In mis embodiment, a homogeneous area of dual-port SRAM is used 
for the burst buffer memory area 26. 
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A burst buffers arrangement which did not employ MATs and BATs (such as is also 
described in European Patent Application No. 97309514.4) could be used in 
alternative embodiments of the present invention - the parameters implicitly encoded 
in MATs and BATs (source address, destination address, length, stride) would then 
have to be explicitly specified for every burst transfer issued. The main reason to use 
MATs and BATs, rather than straightforward addresses, lengths and strides, is that 
this significantly reduces the overall code size. In the context of the present 
invention, this is typically useful, rather than critical. 

Burst instructions originating from the processor 1 are provided to the burst buffers 5 
by means of a burst instruction queue 6. Instructions from the burst instruction queue 
6 are processed by a buffer control element 54 to reference slots in the MAT 65 and 
the BAT 66. The buffer controller also receives control inputs from eight burst 
control registers 52. Information contained in these two tables is bound together at 
run time to describe a complete main-memory-to-burst-buffer transaction. Outputs 
are provided from the buffer controller 54 to direct memory access (DMA) controller 
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56 and hence to the memory datapath arbiter 58 to effect transactions between the 
main memory 3 and the burst buffers memory area 26. 

The key burst instructions are those used to load data from main memory 3 to the 
burst buffer memory area 26, and to store data from the burst buffer memory area 26 
to the main memory 3. These instructions are "loadburst" and "storeburst". The 
loadburst instruction causes a burst of data words to be transferred from a detemuned 
location in the memory 3 to that one of the burst buffer, There is also a 
corresponding storeburst instruction, which causes a burst of data words to be 
transferred from that one of the burst buffers to the memory 3, beginning at a apecific 
address in the memory 3. For the architecture of Figure 1, additional synchromsaUon 
instructions are also required - these are discussed further below. 



0 The instructions loadburst and storeburst differ from normal load and store 

S 15 instructions in that they complete in a single cycle, even though the transfer has not 
occurred. In essence, the loadburst and storeburst instructions tell the memory 
interface 16 to perform the burst, but they do not wait for the burst to complete. 

The fundamental operation is to issue an instruction which indexes to two table 
entries, one in each of the memory access and buffer access tables. The index to the 
memory access table retrieves the base address, extent and stride used at the memory 
end of the transfer. The index to the buffer access table retrieves the base address 
within the burst buffer memory region. In the embodiment shown, masking and 
offsets are provided to the index values by a context table (this is discussed further m 
European Patent Application No. 97309514.4), although it is possible to use actual 
addresses instead. The direct memory access (DMA) controller 56 is passed the 
parameters from the two tables and uses them to specify the required transfer. 
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Table 1 shows a possible instruction set. 
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Parameter Value 



BB_LO ADB URST 


mat_index (integer), 
batjndex (integer), 
block_increment (boolean) 


Load a burst of data into the burst 

i , rr~-r mpmnrv from main 

butler mcmuiy 

memory, and optionally 

m^mpntc the base address in 
increment* uic uiwv 

main memory 

c * rtt -^ « Vinr<;t of data into mam 


BB_STOREBURST 


mat_index (integer), 
bat_index (integer), 
blockjncrement (boolean) 


otore a Duiai u» 

memory from the burst buffer 

mpmnTV and optionally 
memory, r 

increments the base address in 
1 main memory 


BB_LX_INCREMEN1 


N/A 


Increment the value of the LX 
semaphore 


BB_XS_DECREMbN l 


N/A 


" Decrement the value of the XS 
I semaphore 

" Sets a MAT entry to me aesired 


" BB_SET__MAT 


entry (integer), memaddr (integer), 
extent (integer), stride (integer) 


1 values 


BB_SET_BAT 


" entry (integer), bulaaar ^integer) 
extent(integer) 


; Sets a BAT entry to me aesired 
values 



10 



Table 1 : Instruction set for burst buffers 

The stored instruction (BB.STOMBURST) indues parameters in the MAT and 
BAT, which define the characteristics of the requested transfer. If the 
Ho* increment bit is set, the nenaMr field of the indexed entry in the MAT ,s 
automatically updated when the transfer completes (as is discussed below). 

The !«*- instruction (BB_LOADBTJRST) also indexes parameters in foe MAT 
and BAT, again which define the characteristics of the required transfer. As before, , 

the »WJ«— « me — » «"» M ^ -* m ^ 

automatically updated when the transfer completes. 
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The sync— >„ ^auctions needed are provided as ^™ ~' £ 
rfcls-- Decrement (BB_LX JNCREMENT and BB_XS_DECREMENT). Th 
e tfBB LX INCREMENT is to make sure «ha. ft. execution of coprocessor 2 

— — 5 ~ a r:eii::: f a2L:r:- 

ur, vc DFCREMENT is to make sure that the execution in 
to be stored back into main memory 3. 

to this embodiment, the specific mecha^sm upon which ^ instructions act is a se, 

of two counters that track, respectively : tnrpharsV 
ft. number of regions in burs, buffer memory 5 ready to recervc a S — , 

and 

the number of completed loadburst instructions. 

.„ i an- nerformed by decrementing the LX 
Requests for data by the coprocessor 2 are perform ,„„ XScounter 

color whereas the avaiiabihty of data is signaled by incremenUng the XS conn*, 
hunters have to satisfy two proposes: they must be accessible to oniy on 
iem component a, any given time, and they must have the abihty to suspend fte 
process that requests unavailable data. 

T*e existing concept that matches most closely what is required is the semaphore as 
The existing c ep ^Cooperating Sequential 

described by Dykstra ( [Dijkstra lyosj Academic 

analogous. 

„ „ oionalO instruction mcrements it. Executing a 
value, whereas executmg a Sxgnai U n» 
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„ ait „ on a semaphore whose value is already 0 stops ft. software process or 
11 — which is ^ to execute the wait O until *e vatue o t «. 

semaphore is increased. 

- - > . tVip. rp XS DECREMENT instruction would act like a 
In the present embodiment, the BB_Xb_L>ti<^i Fiv/rFNT 

v n i «, Figure n whereas the BB LXJNCREMEN 1 
Wait 0 on the XS semaphore (11 in Figure 1) wber 

action would act like a Signai 0 on the LX semaphore (10 in Figure 1> A. 
wiU be described later, the coprocessor controller 9 would, conversely perf rm 
I it( ) on the LX semaphore 10 and a Signal 0 on the XS semaphore 1 1 The 

ra „ he the same as described in Dijkstra's paper, 
0 semantics of these instructions can be the same ^ 

although the overall arrangement of Signal 0 and Want P 
signifi antly from that described in the original paper. These instructions wouUb 
led in the appropriate seance (as is discussed further below) in 
sure that the relative temporal ordering of certain events, necessary for the correctness 
15 of the system, is respected. 

Memory Access Table (MAT) 65 will now be described with reference to Figure 3, 
^1 memory descriptor «ab,e ho.ding information relating to - ^ 
Lions invoked in burs. transactions. Each entry in the MAT , an 
20 aescribingatransactiontontainmemory. In thisem bodimen, the MAT 

!6 entries, though different implementations are of course possrble. Each en.ry 

comprises three fields: 



1. 
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Memory address (nenaMr) - the start address of the re.evan, regron m 
main memory. Ideal*, uus location is in physical memory space, as 
virtual address transition may result in a burs, request spacing two 
physica! pages, which would cause difficulties for me memory 

controller. 

Exten, (extent) - the extent of the transfer. This is the lengm of the 
transfer, multiplied by the stride, and gives the las, address t^sfe^red 
plus one. The lengm of the transfer is calculate* by the divis.cn of the 
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field 



extent by the stride, and this is automatically copied to the ft*** 
of the related BAT 66 (see below) after a transfer has completed. 

Stride (stride) - the interval between successive elements in a transfer. 



5 memaddr: 

the channel burst. 



This is the 32 bi. unsized, word-aligned address of fte first elemen, of 



The parameter in ft. — re*- * - ^ 
^ of ft. burs, transfer. ,f the transfer requires t events separated b y a stnde of 

10 S, then the extent is S*L. 

slride .. The pa^eter s„ lde is ft. — <* ~ 

accesses Values of the transfer stride interval are reacted » fte range of 1 to 024. 
^greater than ,024 are automatically truncated to ,024. Reads of ft-^-r 

value is returned). Also, strides must be multiples of fte memory bus wtdft, wh, m 
Tel is 4 bi. — tnmcation (wiftou, rounding) is performed to enforce 

this alignment 

20 An example of values contained by a MAT slot might be: 
{Oxlfeelbad, 128, 16} 

which results in a 32 word (32 4 byte words) burst, with each word separated 
25 by 4 words (4 4 byte words). 

The auto-incremen, indicator bit of a burs, instruction also has relevance „ fte MAT 
65 . If ftis bi, is se, in fte burs, instruct, fte star, address enhy ,s mcreased o pom 
to poin, ,o the nex, memory location should fte burs, have conhnued pas, 32. Tins 
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saves processor overhead in calculating the star, address for the next burs, in a long 

sequence of memory accesses. 
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The buffer access table (BAT) 66 wiU now be described with reference to Figure 3. 
This is again a memory descriptor tabie, in this case holding information relating o 

^ co^v, „,rrv in the BAT 66 describes a transaction to 
the burst buffer memory area 26. Each entry in the BAi oo a 

^ Aq for the MAT 65, the BAT 66 comprises 16 
the burst buffer memory area 26. As tor me mai o , 

. . fnr thp MAT 65 Each entry in this case 
entries, though can of course be varied as for the MAI ox * 

comprises two fields: 

1 . Buffer address Qufaddr) - the start of Ore buffer in the buffer area 

2 . BuffersteeCMSM-thesizeofmebufferareausedatthelasttransfer 

The buffer address parameter buf addr is the offset address for the firs, element of 
the channel-burst to the buffer are,. The burs, buffer area is physically mapped by 
hardware into a region of the processor's memory space. This means tha, the 
processor must use absolute addresses when acting the burs, buffer area. 
However, DMA transfers simply use the offset, so it is necessary for hardware ti> 
m anage any address resolution retired. Illegally ahgned values may be 
au«omaticauy aligned by truncal. Reads of mis register return the value «* to 
me burs, (i.e. if truncation was necessary, men me truncated value ,s returned). The 
default value is 0. 

The parameter but -i« is *. size of me region wimin the buffer area occupied by 
me most recent burst. This register is automatical se, on the completion of a burs. 
, trar^ferwhichtargeteditsentry. Note ma, me value stored is the burs, length, stnee a 
va.ue of 0 indica.es an unused buffer entry. This register may be written, bu, tins ,s 
only useful after a con,ex, switch when buffers are saved and restored. The default 
value is again 0. 
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Programming MAT and BAT entries is performed through the - of ^ET-MAT 
and BB_SET,BAT instruct. The parameter determines the en<ry - the 
MAT (or BAT) to which the current instruction refers. 

Purdrer detaiis o, the burs, buffer architect and me mechamsms for Us control are 
provide* in European Patent Appiication No. 973095,4.4 and the corresponding US 
pit Appiicanon Sena. No. 09/3,526. The detai.s provide* above are prmwdy 
nTL show the arc— demon* of the burs, buffer system, and to show 
rtctionai effect *a, the burs, buffer system can accom P ,is„ toge^ w* «, 
, tapu,, and output tha, it provides. The burs, buffer system is opnmaHy adapM for 
Zl- Jof compuutiona, mode,, which is deveiopeu here into a computanona, 
Ilfort described emboditnen, of dre presen, invention, mis — ona, 
model is described further below. 

5 The burs, insertion oueue 6 has been described above. A signified aspe* of ore 
embodimen, is « — ns are simuariy provided to the coprocesso . « . 

„„«,.. R The coprocessor instruction queue 8 operates in 
coprocessor instruction queue 8. lne coprowa* 

.Leon with *. coprocessor confer 9, which dCermines how «,e coproc^ 
receives instructs torn Ore processor . and how i, exchanges daU w,«r ft. burs, 

20 buffer system 5. 

Us e of me coprocessor mstrucuon oueue S has me important effect that rhe processor 
TZi is de oupied from the caicuiation itseif. During the caicuiauon, processo 
lies are thul avai,ab,e for the execution of other tasts. The oniy 
2S ~ .ead ,o operation of processor , being staUed is d*, one of *e m— 
q „eues 6,8 is ful, of instruction, This case can arise when processor 1 produces 
Lotions for either queue a, a rate fas,er than d,a, a, which u— *e 
consumed. So,udo M to mis prob.em are avadab.e. Effectless can be 
requiring dre processor , to perform a con,ex, swi,ch and rehrm «o 
3 „ qles after a predefined amoun, of time, or upon receip, of an mterrnp, tng^dby 
I fact tha, the number of s,„U occupied in eidrer oueue has decreased 
predefined amoun, Conversed if one of me two queues becomes emp,y because ft. 
processor 1 cannot Keep up wi* me ra«e a, which instructions are consumed, ore 
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consumer of those mstiuctions "processor control 9 or the burst buffer 
controller 7) win stall until new instructions are produced by the processor , . 

Modifications can also be provided to the architecture which ensure tha, no further 
final part of this specification. 

n. basic functions of the coprocessor controller 9 are to fetch daU from fne burs, 
„u ff er memory 5 to the coprocessor 2 (and vice versa), to contro, the ^ * 
coprocessor, and to synchronise the execution of the coprocessor J ^ £ 
appropriate loads from, or stores to, the burs, buffer memory . To 
Actions, me coprocessor confer may be in essence a relative,, sun pl e state 
machine able to generate addresses according to certam rules. 

, Figure 4 shows the coprocessor controUer 9 in its reiationship to the other components 
„ TL architecture, and also shows its constituent elements and its connection. «■* 
other elements in me overaU architecture. Its exact mnction depends on the type o 

Jy) and so may vary in detaii from tha, described beiow. In the case of a CHESS 
20 lessor, fhese inputs and outputs are inpu, and output data streams exchanged 
with the burst buffer memory 5. 

Coprocessor controller 9 performs two main tasks: 

contro. of me communication betiveen me coprocessor 2 and the burs, buffer 

" ^rlsys.ems.a.emroughti.euseofacontiolfinites.emachine., 

Tne coprocessor 2 accesses data in streams, each of which is given an association ™,h 
lofanumberofc„n tt o,regis,ers4, A dd^esformesere g is,e, 41 are g e.^ 

30 in a periodic fashion by contro, finite state machine 42 wrth addressing logtc 43, 
according to a sequence generated by the finite state machine 42. 
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A, ever, ft* of a Cock within the finite state machine 42, the » — — 

I i, Ld the address used to a„ow ft. register « » - ^ 

5 At «he same toe, an appropriate contiol signal U generated by the ft*. -~ 

JL « - - * a multiplexer - so tha, ft. appropriate address ; s sen- * 

burs, buffer memory , together - -he co^t -^^^ 

read/write signal is associated with each register 41, with 

change throughout the whole computation. 

After an address obtained for a register 4, has been used to ^J"^ 
constant quantity is added to its vahre, gene* the same as the wrtft o f ft. 
connection between the coprocessor 2 and the bnrs. buffer memory . Thau, ,Hhe 
„id«h of ftis connection is 4 bytes, then Ore increment made to counter 41 wd, be 4. 
This is essentially comparable to "stride" in the programming of burs, buffers. 

The coprocessor controller mechanism described above aUows the multiplexing of 
ZL1 da. streams a,o„g a sing,e bus. Bach of the data sfreams can be constdered 
to access the single shared bus through its own port. 

. For mis system to operate such ft* the integrity of communication is ensured, it is 
nl^ that at the other end of the bus the coprocessor 2 is ready to read from and 
"this bus in a synchronous manner. It is the responsibility of the application 
rHe (and, specifically, to the par, of the application software that configures 

coprocessor 2) to ensure that: 
25 no two streams try and access the bus at the same time; and that 

the execution of coprocessor 2 is synchronous with the data transfer to and from 

burst buffer memory 5. 
Thi r«er requirement ensures that the coprocessor 2 is ready to read the data placed 
™urs, buffers memo. 5 on the connection between the two devices, and vtce- 



30 versa. 



Although more than one physical line could useful,, be provide, between the Chess 
l y 2 and the burs, buffer memory 5, the genera, need for multip.exrng wou,d stil, 
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remain. Un.ess the number of physical connections between ft. coprocessor 2 and 
Ore burs, buffer memory 5 is greater man or e,ua. ,0 the tota. number of >og,ca^O 
Yearns for the coprocessor 2, it wi.. a.ways be true fta, t»o or more .ogtca. stieams 
have to be mu.tip.exed on the same wire. Techno.ogica. reasons reiated to the des,gn 
of fas. SRAM (as is advantageously used for the burs, buffer memory 5) discourage 
tire use of more than one connection with the coprocessor 2. 

The coprocessor controUer 9 a.so acts to con.ro. the execution of the CHESS array 
comprising coprocessor 2 so ft* it wiH ntn for a specified ^ 

specified number of cyc.es before "freezing" the CHESS array by "gating (fta. ,s 
stoppmg) its interna. Co* in a way «hat does no, affect the interna, state of ft 
pipeles in fte coprocessor 2.. This number o, ticks is specfied usmg fte 
CC START_EXEC instruction, described below. 

Coprocessor controUer 9 is programme, by processor 1 through the use of fte 
coprocessor instruction queue 8. A possib.e Action se, for this coprocessor 

controller 9 is shown in Table 2 below. 



I Opcode 



CC_CURRHNT_PORl 



Parameter Value Uommeni 



n (integer) 



"Port # the next CC_PORl_xxx 
commands will refer to 



CC_PORT_PKklOD 



(integer) 



Period of activity of a port 



rr-r x r Phase start of the activity of a port 

CC_PORTJ>HASb_SlAKl start (integer) Phase start oi me 



CC_PORT_PHASE_BNU 



Phase end of the activity oi a pon 



v^v^_* ViX — I . — — 

U /• 1 Start cycle of the activity of a port 

CC_PORTjriMb_SlARl W (integer) bttncy 



j^-r — — x End evele of the activity of a port 

CC_PORTJl lMb_hND tend (integer) bnacyc ^ ^ 



CC_PORT_ADURbSS 
CC_PORT_ INCREMENT 



^ddr^St^ger) MtiaUddr^sIfor a port 



^j-^e,) Address increment for a port" 



CC_PORT_lS_WKilb" 



CC_START_EXECT 



rw (boolean) 
ricycies (integer) 



Start/Resume the" execution oT 
coprocessor 2 for a determined # of 
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CC_LX,DECRbMENT 
"CC_XS_INCREMENT ~ 



N/A 



N/A 



cycles 



Decrement the value oi ihc L~ 
semaphore 



Increment the value ot me Afc 
semaphore _ 



20 



25 



Table 2: Coprocessor controller instruction set 
For the aforementioned instructions, different choices of instruction forma. cou.d be 
and 1 6 bits encode the optional parameter value described above. 
The semantics of individual instructions are as follows: 

. CC CURRENT PORT selects one of the ports as the recipient of all me 

f o.,~owi„ 8 CC PORT_xxx instructions, until the next CC_CURRENT_PORT 
. CC.PORT J-ERIOD ( ) sets the period of activation of the current port to the 

value of the integer parameter 
. CC PORT PHASE START/CC_PORT_PHASE_END (»„ »«) «< 

s^end of" the Nation phase of the current port to the value of the mteger 

parameter ( start end) . 
. cC.PORT_TIME_START/CC_PORT_TIME_END («,», U>> set the first/las. 

cycle of activity of the current port 
. CC_PORT_ADDRESS (addr^ se* me current address of me current port to the 

value of the integer parameter addw 
. C C_PORT_INCREMENT (addr inCT ) sets the address increment of the current port 

to the value of the integer parameter addr inCT 
. CC_PORT_IS_WRITE (rw) sets the data transfer direction for the current port to 

the value of the Boolean parameter rw 
. CC START EXEC ^ initiates the execution of coprocessor controller 2 for a 

nun^berofclockcyclesspecifiedbymeassociatedintegerparameterne^ 
. CC_LXS_DECREMENT decrements (in a suspensive manner, as previously 

described) the value of the LX semaphore; 
. CC_XSS_1NCREMENT increments the value of the XS semaphore. 
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A port - **- - ^ cunall v a,»e of counter * W ^ foI iostan ce 

• *n algorithm using this at .Ration of the 

* * ~ ^ borst buffer confer 7, 
„ coprocessor controller 9 an 

^uUonofmealBonH- „ 

. ..f^coprocessor^H-^; oeaIK sp eci6c <o *- 
* <• «h» coprocessor conuu 

, Pormepro^^* 6 lo fcetotal 

>f ^r controller 9 is configured 

D number, v ^ descrw , nrm the desired 

me programming oi 
^is^^' 



25 



30 



^nfieuration. Aimou^ n the 

address configur stteam w u 

^ var.ab.Uty fotwar d manner w»» staB0 , 

do „bie-buffenn 6 >» » double -buffennS, as 

» ** * impressi0 n W i« * « 
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continuous streams, whereas * fee, buffers are being switched 

continuously. 

commands have .o be sen. » the burs, instruct queue 6 m order to « 

.ansfers of da» ,o and from main memory 3 into the burs, buffe, mem ^ ™- 
„ .inns ibb SET MAT and BB SET.BAT) configure toe appropna,e enttres 

iZZZri *e MAT, , a manner conststen, with the prog— of*e 

I "lor confer , U this embodiment toe inshnctions * 
and me BAT entries are issued through me burs, — n queue « . An ^™ 
possibility would be toe use of memory-mapped regrsters winch me processor 
possibmty w me , ^bodiment mere is no posstbtuty of 

would write to and read from. As m me pres. 

query me s,a,e of n 6 for «. purpose 

without the supervision of me processor 1. 
, After these steps have been performed, the a«ua, execution of the CHESS array can 

----- — 

„ Iteclolcyc^erthisvaluehasbeen— <^^£ 
CHESS array of coprocessor 2, and enables tne exc 
A „ important step mus, however be added before ins— 

30 ^utatio ^-^:^z^z:Zw** 

necessary synchronisation mechanisms are m place to imp 

*e coprocessor confer 9 will By to decremen, the value of me 
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^c described above. The initia! va,ue of tfcs semaph^e * > 
toiler , an, to coprocessor 2 are ^ ^J^— 7 a«er a 
val ue of a. LX semaphore - mcremented by the bun* b ^ 
successful W6»r« instruction will the coprocessor 2 be able start I 
successrui <o x DECRE MENT instruction is inserted m 

execution. To achieve th.s effect, a CCLX_DW-K= executi0 „" 
to coprocessor instmction queue 8 before to "start coprocessor 2 execuflo 
TccXrT EXEC) instruction. As will be shown, a correspond "mcremenUhe 
(BB_LX_INCREMENT) instruction win be inserted in to burs, 
insttu ction queue 6 after the corresponding WW instn.ct.on. 

^ actua, transfer of data between CHESS logical streams 

memory 5 is carried ou, in accordance with to programming of to cop— 

controller 9 as previously described. 

The number of ticks for which to counter 42 has to run depends on how ,ong it takes 

a buffer has been consumed, to execution of coprocessor 2 w.U stop^The next 
a butter nas dcc synchronisation 
0 instruction in the coprocessor instruction queue 8 must y» 

■ ^ ti rthat is a CC LX DECREMENT), in order to ensure that the next burst 
rrrJ nl to « buffers memo, 5. lowing this — <and 

m new burs, of data is assigned to to data stream (with a CC -^ ORT -^ 

mis new QTART EXEC instruction. Ine 

3 into burst buffers memory 5). 



30 



Computational Model 
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An motion of the overall computation mode, wi.. now oe d^cribeo with 

reference ,„ Figure 5. The iUustration indicates how an algorithm can 

US e in .his architecture, using as an examp.e a simple vector additton, whtch can 

coded in C for a conventional microprocessor as: 



10 



1 15 



20 



int a [10241, b[1024], c [1024] ; 
f or (i=0 ; i<1024 ; 

Apiecl of C^d" » run processor , which achieves on the architecture 
ft. same rationality as the originai vector addition loop nes, is as foUows: 



25 



30 



35 



"int a[1024], b[1024j c L11U4J ; 
int eo, not_eo, k; increment, xfer size, 

.port 0 specification: P^^j^"^^, r/w */ 

) ; 



/♦Port 0 specification: p«. . „, ^ r/W 
phase start, phase end, start J im • Q , 



period. 



40 



45 



50 



10 : 
11: 

12 : 

13 : 
1 14 : 
1 15: 

16: 
17 : 
18: 
19: 
20: 
21: 

22 : 

23 : 
24: 
25: 
26: 
27: 
28: 
29: 
30: 
31: 
32: 

33 : 

34 : 
35: 

36 : 

37 : 
38: 
39: 



CIQ_STREAM( 0, 4 f 4, 3, 0, 
/★Port 1 specification*/ 
C I Q — STREAM ( 1 , 4 , 4 , 3 , 1 , 
/*Port 2 specification*/ 
CIQ_STREAM( 2, 4, 4, 3, 2, 
BIQ__SET_MAT(0, 
BIQ_SET_MAT(1, 
BIQ_SET_MAT(2, 
BIQ_SET_BAT(0, 
BIQ_SET_BAT(2, 
BIQ_SET_BAT(4, 



1 0, 3*BLEN*MAXK+3, 
2, 0 , 3 * BLEN * MAXK+ 3 , 



0 ); 



&b[0] , BLEN* 4 , 
&c[0l , BLEN* 4 , 
&a [0] , BLEN* 4 , 
0x0000, BLEN* 4) 
0x0200, BLEN* 4) 
0X0400, BLEN* 4) 



3, 0, 3*BLEN*MAXK+3, 1 )i 
4) 
4) 
4) 



BIQ__SET_BAT ( 1 , 
BIQ_SET_BAT(3, 
BIQ_SET_BAT(5, 



0x0100, BLEN* 4) ; 
0x0300, BLEN* 4) ; 
0X0500, BLEN* 4) ; 



0; k < MAXK; K++ ) 



for( k 

^ /*Even or odd iteration? 
eo = k&Oxl; 
CIQ_LXD(2) ; 

CIQ SA(0, (BLEN*4*eo) ) ; 

CIQ SA(1, ( (2*BLEN*4) +BLEN*4*eo ; 

Cicf SA(2, ( (4*BLEN*4) +BLEN*4*eo) ) ; 

/★Start Chess*/ 

CIQ_ST(3*BLEN) ; 

CIQ_XSI (1) ; 

/*BB stuff*/ 

/*Load A*/ 

BIQ_FLB(0,eo) ; 

/★Load B*/ 
BIQ_FLB(2,2+eo) ; 

BIQ_LXI(2) ; 

if ( k >= 1 ) 



For double buffering*/ 



{ 



} 



} 



not__eo = (eo«0)?l:0; 

BIQ_XSD(1) ; 

BIQ_FSB (4 , 4+not_eo) ; 



eo = MAXK & 0x1; 
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40: not_eo = (eo==0)?l:0; 

41: BIQ_XSD(1) ; 

42 : BIQ_FSB (4 , 4+not_eo) ; 



) 



In this arrangement, three ports are used in coprocessor controller 9: one for each 
input vector (b and c) and one for the output vector (a). The statements at lines 4, 6 
and 8 are code macros to initialise these three ports. These, when expanded, result in 
the following commands (this is with reference to line 4 - the other expanded macros 

are directly analogous): 
CC_CURRENT_PORT(0) ; 
CC_PORT_INCREMENT(4) ; 
CC_TRANSFER_S I ZE ( 4 ) ; 

CC_PORT_PERIOD(3) ; 

CC_PORT_PHASE_START(0) ; 

CC_PORT_PHASE_END ( 1 ) ; 

CC_PORT_START_TIME(0) ; 

CC PORT_END_TIME (3*BLEN*MAXK+3) ; 

CC_PORT_IS_WRITE(0) ; 

This code has the effect that port 0 will read 4 bytes of data every 3' d tick of counter 
42, and precisely at ticks 0, 3, 6 ... 3*BLEN*MAXK+3 , and will increment the 
address it reads from by 4 bytes each time. BLEN*MAXK is the length of the two 
vectors to sum (in this case, 1024), and BLEN is the length of a single burst of data 
from DRAM (say, 64 bytes). With these values, MAXK will be set to 1024/64=16. 

Lines 9 to 14 establish MATs and BATs for the burst buffers transfers, tying entries in 
these tables to addresses in main memory 3 and burst buffers memory 5. The 
command BIQ_SET_MAT(0, &b[0], BLEN*4, 4, TRUE) is a code macro that is 
expanded into BB_SET_MAT(0, &b[0], BLEN*4, 4) and ties entry 0 in the MAT to 
address &b[0], sets the burst length to be BLEN*4 bytes (that is, BLEN integers, if an 
integer is 32 bits) and the stride to 4. The two lines that follow are similar and relate to 
c and a. The line BIQ_SET_BAT(0, 0x0000, BLEN*4) is expanded to 
BB_SET_BAT(0, 0x0000, BLEN*4) and ties entry 0 of the BAT to address 0x0000 
in the burst buffers memory 5. The two lines that follow are again similar. 
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Up to this point, no computation has taken place; however, coprocessor controller 9 
and burst buffers controller 7 have been set up. The loop nest at lines 15 to 38 is 
where the actual computation takes place. This loop is repeated MAXK times, and 
5 each iteration operates on BLEN elements, giving a total of MAXK*BLEN elements 
processed. The loop starts with a set of instructions CICLxxx sent to the coprocessor 
instruction queue 8 to control the activity of the coprocessor 2 and coprocessor 
controller 9, followed by a set of instructions sent to the burst instruction queue 6 
whose purpose is to control the burst buffers controller 7 and the burst buffers 
10 memory 5. The relative order of these two sets is in principle unimportant, because 
the synchronisation between the different system elements is guaranteed explicitly by 
the semaphores. It would even be possible to have two distinct loops running after 
each other (provided that the two instruction queues were deep enough), or to have 
two distinct threads of control. 

The CIQ_xxx lines are code macros that simplify the writing of the source code. Their 

meaning is the following: 

CIQ_LXD(N) inserts N CC_LXS_DECREMENT instructions in the 

coprocessor instruction queue 8; 
20 CIQ_SA(port, address) inserts a CC_CURRENT_PORT(port) and a 

CC_PORT_ADDRESS(address) instruction in the coprocessor instruction 

queue 8; 

CIQ_ST(cycleno) inserts a CC_EXECUTE_START(cycleno) instruction in 
order to let the coprocessor 2 execute for cycleno ticks of counter 42; and 
25 CIQ_XSI(N) inserts N CC_XSS_INCREMENT instructions in the 

coprocessor instruction queue 8. 

The net effect of the code shown above is to: 

synchronise with a corresponding loadburst on the LXS semaphore; 
30 start the computation on coprocessor 2 for 3*BLEN ticks of counter 42; and 

synchronise with a corresponding storeburst on the XSS semaphore. 
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The BIQ_xxx lines are again code macros that simplify the writing of the source code. 

Their meaning is as follows: 

BIQ_FLB(mate,bate) inserts a BB_LOADBURST(mate, bate, TRUE) 

instruction into the burst instruction queue 6; 
5 BIQ_LXI(N) inserts N BB LX INCREMENT instructions in the burst 

instruction queue 6; 

BIQ_FSB(mate,bate) inserts a BB_STOREBURST(mate, bate, TRUE) 
instruction into the burst instruction queue 6; and 

BIQ_XSD(N) inserts N BB_XS_DECREMENT instructions in the burst 

10 instruction queue 6. 

The net effect of the code shown above is to load two bursts from main DRAM 
memory 3 into burst buffers memory 5, and then to increase the value of the LX 
semaphore 10 so that the coprocessor 2 can start its execution as described above. In 
15 all iterations but the first one, the results of the computation of coprocessor 2 are then 
stored back into main memory 3 using a storeburst instruction. It is not stnctly 
necessary to wait for the second iteration to store the result of the computation 
executed in the first iteration, but this enhances the parallelism between the 
coprocessor 2 and the burst buffers memory 5. 

20 

The use of the two variables eo and not_eo is a mechanism used here to allow the 
double-buffering effect described previously. 

Lines 39 to 42 perform the last burst transfer to main memory 3 from burst buffers 
25 memory 5, compensating for the absence of a storeburst instruction in the first 
iteration of the loop body. 

The resulting timeline is as shown in Figure 6. Loadbursts 601 are the first activity 
(as until these are completed the coprocessor 2 is stalled by the load/execute 
30 semaphore), and when these are completed the coprocessor 2 can begin to execute 
602. The next instruction in the burst instruction queue 6 is another loadburst 601, 
which is carried out as soon as the first two loads have finished.. Then, the next 
instruction in the burst instruction queue 6 is a storeburst 603, which has to wait until 
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the XS semaphore 11 signals that the first computation on coprocessor 2 has 
completed. This process continues throughout the loop. 

Although the example indicated above is for a very simple algorithm, it illustrates the 
basic principles required in calculations that are more complex. The person skilled in 
the art could use the approach, principles and techniques indicated above for 
programming the architecture of Figure 1 to adapt more complex algorithms for 
execution by this architecture. 
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Tool chain for computation 



The principles of the computation model can be exploited in straightforward fashion 
by hand coding - that is, manually writing C code to run on the CPU adapted in 
conventional manner to schedule the appropriate operation of the system components 
(to place instructions in the appropriate queues, and to set the system components into 
operation as described), and to provide an appropriate configuration for the 
coprocessor in accordance with the standard synthesis tools for configuring that 
coprocessor. For a configurable or FPGA-based processor like CHESS, this tool will 
generally be a hardware description language. An appropriate hardware description 
language to use for CHESS is JHDL, described in, for example, "JHDL - An HDL for 
Reconfigure Systems" by Peter Bellows and Brad Hutchings, Proceedings of the 
IEEE Symposium on Field-Programmable Custom Computing Machines, April 1998. 

25 A preferred alternative is for a specific toolchain to be used for this computational 
architecture. The elements of such a toolchain and its practical operation are 
described briefly below. 

The toolchain has the function of converting conventional sequential code to code 
30 adapted specifically for effective operation, and interoperation, of the system 
components. The exemplary toolchain receives as input C code, and provides as 

output the following: 

a CHESS coprocessor configuration for execution of the computation; 
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burst buffer schedule for moving data between the system memory and the 
burst buffer memory; and 

a coprocessor controller configuration for moving data between the CHESS 
coprocessor and the burst buffer memory. 

The toolchain itself has two components. The first is a frontend, which takes C code 
as its input and provides annotated dependence graphs as its output. The second 
component is a backend, which takes the dependence graphs generated by the 
frontend, and produces from these the CHESS configuration, the burst buffers 
schedule, and the coprocessor controller configuration. 

The main task of the frontend is to generate a graph which aptly describes the 
computation as it is to happen in coprocessor 2. One of the main steps performed is 
value-based dependence analysis, as described in W. Pugh and D. Wonnacott, "An 
Exact Method for Analysis of Value-based Array Data Dependences", University of 
Maryland, Institute for Advanced Computer Studies - Dept. of Computer Science, 
University of Maryland, December 1993 . The output generated is a description of 
the dataflow to be implemented in the CHESS array and a representation of all the 
addresses that need to be loaded in as inputs (via loadburst instructions) or stored to 
as outputs (via storeburst instructions), and of the order in which data has to be 
retrieved from or stored to the main memory 3. This is the basis upon which an 
efficient schedule for the burst buffers controller 7 will be derived. 

If we assume, as an example, the C code for a 4-tap FIR filter: 

int i, j, src[], kernel [] , dst [] ; 
for( i = 0; i < 1000; i++ ) 
for( j = 0; j < 4; j++ ) 

dst[i] = dst[i] + src[4 + i-j]*kemel[j] ; 

as the input to the frontend, the output, provided as a text file, will have the following 

form: 
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loop:0<=i<999 #loop nest description 
Ioop:0<=j<4 

16:str/0/0/20/ #store instruction 
LOD: 

5 # A rray:d [1/0/0] at line 11 

20:ldc/l6/0/0/ #load constant 

22:str/0/0/26/ #store instruction, which 

LOD: 4 <= j #writes its outputs to main 

#Array:d [1/0/0] at line 13 ttmemory if 4<=j 
10 26:add/22/27/31/ #addition 

27:lod/26/0/0/ iload instruction, taking its inputs 
DepdS): 10] [0] / Range: j <= 0 #from instruction 16 if 3<=0 
D ep(22)= [.Hll / Hange: 1 <= j *from instruction 22 otherw.se 
LID: 

15 #Array:d [1/0/0] at line 13 

31:mul/26/32/37/ Multiplication 
32:lod/31/0/0/ #load instruction 
Dep(32): [1] HI / Range: 1 <= i && 1 <= D 

LID- i <- 0 | | j <= 0 && 1 <= 1 *" hich tak6S ltS inPUtS fr ° m . 
20 #A rray:src[l/-l/0] at line 13 Memory if i <= 0 | | j <= ° & & 1 <= 1 

37:lod/3l/0/0/ 

Dep(37) = 11] 10] / Range: 1 <= i #load instruction 
LID: i <= 0 #taking its inputs from main memory xf 
#Array:kernel [0/1/0] at line 13 #i<=0 

This .ex. file is a repression of an annotated graph. The graph ..self is shown in 
Figure 7 The graph dearly shows the dependencies found by the frontend algonftm. 
Edges 81 are marked with ft. condition under which a dependence exists, and fte 
dependence dislance where appUcable. The description provided contains all toe 
30 information necessary .0 genera.e a hardware componen. with the ^ 
functionality. 

The backend of ft. compilation toolchain has certain basic functions. One isto 
schedule and retime the extended dependence graph obtained from ft. frontend. Ttas 
,5 is necessary to obtain a fully functional CHESS configuration. Scheduhng involves 
determining a point in time for each of ft. nodes 82 in the extend*, dependence graph 
,o be activated, and retiming involves, for example, the insertion of de!ays to ensure 
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fta, edges propagate values a, the appropriate moment Scheduling can be performed 
using shifted-linear d^. a .echnioue ^ used in hardware synftes,s 
Reding is a common and ouite stiaigh,forward ,ask in hardware synftes.s, and 
m erely involves adding an appropriate number of registers to Ore circmt so tha, 
5 different pafts in fte circuit mee, a, fte appropriate point in time. A. mis pom,, we 
bave a complete description of me functionality of me coprocessor 2 (here, a CHESS 
coprocessor). Ms description is shown in Figure 8. This description can then be 
passed on ,0 the appropriate tools to generate the sequence of signals (common* 
referred to as "bitstream") necessary to program the CHESS coprocessor wtft tins 
10 functionality. 

Another function retired of ft. backend is generation of fte burs, buffer and 
coprocessor control schedule. Once fte CHESS configuration has been obtamed, ,, 
is apparent when i, needs to be fed with values from main memory and when .values 
„ can be stored back to main memory, and fte burs, buffer schedule can be es,abhsh<d. 
According*, a step is provided which involves splitting up fte address space of all fte 
data fta, needs ,o be .oaded into or stored from fte burs, buffers memory 5 mto fixed 
bursis of data fta, fte burs, buffers controller 7 is able to ac, upon. 

„ For instance, in fte FIR example jus. presented, fte input array (src U ) is split into 
several bursts of appropriate sizes, such mat all fte address range needed for fte 
algo riftm is covered. Tnis toolchain uses bursts of lengft ft. (where B to . a power 
of 2, and is specified as an execution parameter .0 fte .oolchain) to cover as much „ 
ft, inpu, address space as possibie. When no more can be achieved wift tins burs 
« length, the toolchain uses bursts of decreasing lengfts: IW2, B, J4. ftj*. » , i 
until every inpu, address needed for fte algoriftm belongs ,o one and only on, burs,. 

For each one of these burs*, fte earlies, point in fte iteration space in which any of 
fte data loaded is needed is computed. In ofter words, to each inpu. burs, ftere ,s 
3 „ associated one poin, in fte iteration space for which i, is g«aran,eed fta, no earher 
iterations need any of fte daU, loaded by fte burs,. It is easy to de.ec, when fte 
execution of fte coprocessor 2 would reach fta, poin, in fte iteration space. There are 
thus created: 
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a loadburst instruction for the relevant addresses, in order to move data into 
burst buffer memory 5; and 

a corresponding synchronisation point ( a CC_LX_DECREMENT / 
BB_LX_TNCREMENT pair) to guarantee that the execution of coprocessor 2 
is synchronised with the relevant loadburst instruction. 

To achieve an efficient overlap of computation and communication, the loadburst 
instruction has to be issued in advance, in order to hide the latency associated with the 
transfer of data over the bus. 

All the output address space that has to be covered by the algorithm is partitioned into 
output bursts, according to a similar logic. Again, the output space is partitioned mto 
bursts of variable length. 



15 The toolchain creates: 

a storeburst instruction for the relevant addresses; 

a corresponding synchronization point (BB_XS_DECREMENT / 
CC_XS_INCREMENT pair) 

20 At this point, we possess information relevant to: 

the relative ordering of loadburst and storeburst instructions, and their 
parameters of execution (addresses, etc.) 

their position relative to the computation to be performed on coprocessor 2. 
This information is then used to generate appropriate C code to organise the overall 
25 computation, as in the FIR example described above. 

The actual code generation phase (that is, the emission of the C code to run on 
processor 1) can be accomplished using the code generation routines contained m the 
Omega Library of the University of Maryland, available at 
30 http://www.cs.umd.edu/projects/omega/ , followed by a customised script that 
translates the generic output of these routines into the form described above. 
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F.v perimental P emits - Image Convolution 



An image 



convolution algorithm is described by the following loop nest: 



for (i=0; i<IMAGE_HEIGHT; i++) 
for ( j =0 ; j <IMAGE_WIDTH; j ++) 
for (k=0 ; k<KERNEL_HEIGHT ; k++ ) 
for (1=0 ; 1<KERNEL_WIDTH; 1++) 

Destli.jl +- source[(i + l)-k.(j + l)-H*C[lc.l]» 

Replication has been used to enhance the source image by kerne L _height-1 pixels in 
the vertical direction and ker^.wx^-I pixels in the horizontal direction in order 
to simplify boundary condition, Two kernels are used in evaluating system 
performance: a 3x3 kernel and a 5x5 kernel, both performing medxan filtenng- 

5 Figures 9 and 10 illustrate the performance of the architecture according to an 
embodiment of the invention (indicated as BBC) as against a conventional processor 
using burst buffers (indicated as BB) and a conventional processor-and-cache 
combination (indicated as Cache). Two versions of the algorithm were implemented 
one with 32-bit pixels and one with 8-bit pixels. The same expenmenta 

J0 measurements were taken for different image sizes, ranging from 8x8 to 128x128, and 
for different burst lengths. 

As can be seen from the Figures, the BBC implementation showed a great 
performance advantage over the BB and the Cache implementation, The algonthm .s 
25 relatively complex, and the overail performance of the system in bom BB and C*che 
indentations is heavily compute-bound - the CPU simply cannot keep up because 
of the high complexity of the algorithm. Using embodiments of the invent™, m 
which the computation is vastly more effective as it is carried out on the CHESS array 
(with its inherent parallelism), the performance is if anyttung 10-bound - even though 
30 10 is also efficient through effective use of burs, buffers. Multimedia instructions 
(such as MIPS MDMX) could improve the performance of the CPU in the BB or the 
Cache implementations, as they can allow for some parallel execution of anthmetic 
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instructions. Nonetheless, the performance enhancement resulting is unlikely to reach 
the performance levels obtained using a dedicated coprocessor in this arrangement. 

Modifications a nd Variations 

The function of decoupling the processor 1 from the coprocessor 2 and the burst 
buffer memory 5 can be achieved by means other than the instruction queues 6,8. An 
effective alternative is to replace the two queues with two small processors (one for 
each queue) fully dedicated to issuing instructions to the burst buffers memory 5 and 
,0 the coprocessor 2, as described in Figure 12. The burst instruction queue is replaced 
(with reference to the Figure 1 embodiment) by a burst command processor 106, and 
the coprocessor instruction queue is replaced by a coprocessor command processor 
108. Since this would be the only task carried out by these two components, there 
would be no need for them to be decoupled from the coprocessor 2 and the burst 
15 buffers 7 respectively. Each of the command processors 106, 108 could operate by 
issuing a command to the coprocessor or burst buffers (as appropriate), and then do 
nothing until that command has completed its execution, then issue another command, 
and so on. This would complicate the design, but would free the main processor 1 
from its remaining trivial task of issuing instructions into the queues. The only work 
20 to be carried out by processor 1 would then be the initial setting up of these two 
processors, which would be done just before the beginning of the computation. 
During the computation, the processor 1 would thus be completely decoupled from 
the execution of the coprocessor 2 and the burst buffers memory 5. 

25 Two conventional, but smaller, microprocessors (or, alternatively, only one processor 
ninning two independent threads of control) could be used, each one of them running 
the relevant part of the appropriate code (loop nest). Alternatively, two general state 
machines could be synthesised whose external behaviour would reflect the execution 
of the relevant part of the code (that is, they would provide the same sequence of 
30 instructions). The hardware complexity and cost of such state machines would be 
significantly smaller than that of the equivalent dedicated processors. Such state 
machines would be programmed by the main processor 1 in a way similar to that 
described above. The main difference would be that the repetition of events would be 
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encoded as well: this is necessary for processor 1 to be able to encode the behaviour 
of one algorithm in a few (if complex) instructions. In order to obtain the repetition of 
an event x times, the processor 1 would not have to insert x instructions in a queue, 
but would have to encode this repetition parameter in the instruction definition. 

As indicated above, a particularly effective mechanism is for finite state machines 
(FSMs) to be used instead of queues to decouple the execution of the main processor 
1 from the execution of coprocessor 2 and the burst buffers controller 7. This 
mechanism will now be discussed in further detail. 

In the architecture illustrated in Figure 1, instructions to drive the execution of 
different I/O streams can be mixed with instructions for execution of coprocessor 2. 
This is possible because the mutual relationships between system components is 
known at compile time, and therefore instructions to the different system components 
can be interleaved in the source code in the correct order. 



Two state machines can be built to issue these instructions for execution in much the 
same way. One such state machine would control the behaviour of the coprocessor 2, 
issuing CC_xxx_xxx instructions as required, and the other would control the 
20 behaviour of burst buffers controller 7, issuing BB_xxx_xxx instructions as required. 

Such state machines could be implemented in a number of different ways. One 

alternative is indicated in Figure 13. With reference to the vector addition example 

presented above, this state machine 150 (for the coprocessor 2, though the equivalent 
25 machine for the burst buffers controller 7 is directly analogous) implements a 

sequence of instructions built from the pattern: 

CC_LX_DECREMENT, 

CCLXDECREMENT, 

CC_START_EXEC, 
30 CC_XS_1NCREMENT. 

The main state machine 150 is effectively broken up into simpler state machines 151, 
152, 153, each of which controls the execution of one kind of instruction. A period 



30990081 WO 



38 



and a phase (note, this has no relationship to periods and phases which can be 
associated with I/O streams communicating between the coprocessor 2 and the burst 
buffers controller 7) is assigned to each of the simpler state machine, The hardware 
of state machine 150 will typically contain an array of such simpler state machmes m 
s anumber sufficient to satisfy the requirements of intended applications. 

An event counter 154 is defined. The role of the event counter 154 is to allow 
instructions (in this case, for coprocessor 2) to be sent out in sequence. Each time the 
event counter 154 is incremented, if there exists a value M such that 
0 M^Period^Phase^alue of Event Counter, the state machine i (i.e. one of the simpler 
state machines 151, 152, 153) is chosen for execution through comparison logic 155, 
and its instruction is executed. It is the responsibility of the application software to 
ensure that no two distinct state machines can satisfy this equation. When the 
execution of that instruction is completed, the event counter 154 is incremented agam. 
15 This sequence of events can be summarised as: 



20 



25 



1: 
2: 



30 



Increment event counter: EC++ 

Choose state machine i for execution if there exists an M such that 
M*Periodi+Phasei=EC 
3: If such a state machine i has been found, execute the instruction described by 

state machine i (this could include a suspension operation) 
4: Go back to 1 

A few extra parameters relevant to the execution of an instruction (addresses to read 
from/write to, lengu, of execution for a CC.START ECEC, etc.) trill have ,0 be 
encoded in the state machine 150. It should also be noted that more than one state 
machine can issue a given instruction, typically with different parameters. 

This system works particularly well to generate periodic behaviour. However, if an 
event has to happen only once, it cm readily be encoded in a simple sMe mactane 
with infinite period and finite phase, the only consequence being that mis simple state 
machine will be used only the once. 
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approach can itself be varied. For examp.e, » add flexibili.y «o «- — 
predetennined 'time window'. 

The programing of .hese state machines would happen during .he * 

lor An al«erna«ive wou.d be the loading of all the parameters necessary to 
! machines from a predefined reg.on of mam memo. 3 
Jgn the use of a dedicated channel and a Direct Memory Access (DMA) 



mechanism. 



15 



20 



The other alternative mechanism suggested, of using two ded.ca.ed 

„ oul d require no signified modification to the programmmg mode, or the 

HZ of Figure 1: me same techniques used to program main processor 1 could 

or other DRAM, adding .o the »f the system. The cos. ^ — J 
me sys.em would also be increased by adding (and „nden,..,,mg, ■ d- - > 
presen. to perform very simple computations) two microprocessors m tlus way. 

conditionals/unknown execution times, and non-afflne accesses to memory. 

Pipeline archi,ec«ures have value where applications require more *- one 

this kind of arrangemen., changes «o bom me 

mode, will be required. Architecturally, successrve buffered CHESS arrays co 
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o CHESS array reconfigured 
a r-uP^S array, or a ~ ' 

provided, o, a larger panned CHES ^ ^ ^ „ pipeline 

bero een computational stages. tavolving plur al CHESS arrays, 

lectures effete to handle sue, buffer pipeline 

Figure ,1* shows an arrangement a ntain memory .44, where 

^Ced front a processor .43 and ««« * ^ ^ ^ pasKS it t0 . 
a CHESS array .41 receives data front a M~ w mteracti „g with a 

^nd se, of burst buffers 145, Ms secon - ^ ^ ^ 

further CHESS array ™ C— " ~ ^ 

sets of CHESS arrays and burs, .4 hetween adjacnt 

md involves c— anon between e foUowe d ,0 al.ow efficient 

*— ^ T 

semaphores could be used 10 s 

by successive stages of the pipeline. 

"th an SRAM cache 

5 Flgore „B sbows a different ^ of — . te t set of « 

155 be^een two CHESS arrays 151. 1* 1J7 ^ role 

buffers 152 and ^provided by a s.ond s« ^ ^ 

processor 153 and of the main memoO ^ nttlisairang emen,,a.nto„ 6 hu,e 
Jimm ts svuehronisation may be less dime 
M r::rrnay-«,oi.paral,.is n .esse ff ec«ve,, 

rf the coprocessor in an architecbrre as described 
One constrain, on efficen, «se ^> taplementati on should be known 

^ve is that the e— Ume of* ^ ^ ^ loops. 
25 (,„ al.ow efficient scheduhng). ^ ^ „. ^eduhng 

However, if execution tunes - m d appropriate aUowances need to 

recruirements in the toolchain need to be » ^ pn)Cessor , 

^-^•^srrzi- — - - - 

nn d the burst butters. i» *- 
the coprocessor ana 
30 specificconftguranonforthiscircumsunce. 
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„ .tension is to - — - « - = ^ 

baK , b uff«s — - ^ve, all access - of* ^ ^ 

m ode! allows the coprocessor controller and «b > ^ m of logical 

streams. The stgn.ficance of Bus to * ^ way ^ 

^dear how non-affine access could be provtded m P 0 
s^onisaHonn.echanisrnswouldappear.obrea.d^bu^ 

»se non-afllne array accesses ^^iZ^^' 
.oading lookup Ub.es ,n«o burs, buffer, and ^ „ 

^^Trr^c^. - *-* "* c ° u,d * " by a 

advance to the time that they win modifv ^ logical stream 

cement to the synchronisation mechantsm) and to mod* 
; onanism to support mis type ox recursive reference. 

• « ^ extensions to the architecture of Figure 1 can thus be carried out 

Many variations and extensions w 

without deviating from the invention as claimed. 



